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Nobody knows why the rank frequency distribution of instructions in code follows 


Zipf's law. But does it have to be this way? 


All software is made up from the instructions of some programming language. Instructions 
- comprising operators and core functions - are the atomic elements of code that do things 


such as adding two numbers or calculating the length of a sequence of characters. 


If we look at any large piece of software it is clear that a few instructions are very common 
while many others are hardly ever used. Plotting the frequency of each instruction in 
descending order always gives approximately the same inverse power law distribution. 


This is called Zipf's law. 


Zipf's law was originally noted in the context of linguistics where it applies to the word 
frequency of texts in human languages. Curiously, it is also observed in artificial languages 
as well as in a wide range of human created systems, human and animal behaviour, and 
natural phenomena. These include the distribution of share prices, the populations of cities 
and even the sizes of craters on the moon. Various explanations have been proffered for 
why this law is so ubiquitous but it remains a mystery. A single cause for such a wide 
range of distributions would seem unlikely. Also, the tautological view that 'the most 


common are the most common ...' does nothing to explain the shape of the distribution. 


In software development there are many possible factors that might influence the 
frequency of use of instructions in programs. Different problem domains and programming 


languages might be expected to impact the distribution of instruction frequency yet this is 
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not the case. The underlying machine instruction set certainly has a strong influence on 
the availability of programming language instructions but does not in itself dictate how they 
should be used. This leaves a range of mostly social considerations including developer 
education, collaboration and various forms of knowledge sharing. In effect, the relative 
frequency of instructions used in code is most likely to be a set of norms that are 
developed, disseminated and held collectively across teams, and ultimately the entire 
software development community. As with human language it could be that focussing ona 
small vocabulary reduces cognitive load. In any event repeated reuse of the same patterns 
utilising mostly the same subset of instructions is a self reinforcing feedback process. This 


can also be viewed as an example of the golden hammer principle. 


In any programming language there are often many different ways to write the same 
program. For a given problem most developers will produce some variation of a standard 
solution - using mostly identical instructions. At the same time it is also possible to produce 
many functionally equivalent programs that are very different, and most of these will use 
different subsets of instructions. Some of these solutions will be better or worse than the 
standard ones in some respect such as performance. However, the non-standard solutions 
will exhibit a much larger degree of variety. This increased variety is certain to include 
different and sometimes better algorithms. Human bias with respect to instruction 
frequency implicitly constrains how we develop code and means that we are missing out 


on possible advances in software. 


If Zipf's law in software development is largely a form of human bias then will this also 
apply to software that is produced by an Al? The answer is probably 'yes' if the Al is 
produced through training on a set of example programs produced by humans. However, it 
doesn't have to be this way. Zoea is an Al that is built using explicit software development 
knowledge rather than through training with a set of examples. As a result it is not so 
constrained in terms of which instructions it uses in the code it creates. Taking this 
approach increases the potential for Al generated software to produce innovative and 


optimal results. 


In a sense Zipf's law is a manifestation of the status quo in software development. 
Developer bias with respect to instruction usage is an expedient that likely makes it easier 
for people to learn and communicate about software development. However, it constrains 


our ability to deliver better solutions or even any solution to some problems. It probably 
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isn't possible for human developers to operate any other way. Al coders like Zoea on the 


other hand have the potential to move beyond instruction frequency bias. 


Learn more at zoea.co.uk 
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